In [5]:
import pandas as pd
import networkx as nx
import os
import numpy as np
Networks can be represented in a tabular form in two ways: As an adjacency list with edge attributes stored as columnar values, and as a node list with node attributes stored as columnar values.
Storing the network data as a single massive adjacency table, with node attributes repeated on each row, can get unwieldy, especially if the graph is large, or grows to be so. One way to get around this is to store two files: one with node data and node attributes, and one with edge data and edge attributes.
The Divvy bike sharing dataset is one such example of a network data set that has been stored as such.
Let's use the Divvy bike sharing data set as a starting point. The Divvy data set is comprised of the following data:
The README.txt file in the Divvy directory should help orient you around the data.
In [6]:
stations = pd.read_csv('datasets/divvy_2013/Divvy_Stations_2013.csv', parse_dates=['online date'], index_col='id')
stations
Out[6]:
In [7]:
trips = pd.read_csv('datasets/divvy_2013/Divvy_Trips_2013.csv', parse_dates=['starttime', 'stoptime'], index_col=['trip_id'])
trips = trips.sort()
trips
Out[7]:
At this point, we have our stations and trips data loaded into memory.
How we construct the graph depends on the kind of questions we want to answer, which makes the definition of the "unit of consideration" (or the entities for which we are trying to model their relationships) is extremely important.
Let's try to answer the question: "What are the most popular trip paths?" In this case, the bike station is a reasonable "unit of consideration", so we will use the bike stations as the nodes.
To start, let's initialize an directed graph G.
In [8]:
G = nx.DiGraph()
Then, let's iterate over the stations DataFrame, and add in the node attributes.
In [9]:
for r, d in stations.iterrows(): # call the pandas DataFrame row-by-row iterator
G.add_node(r, attr_dict=d.to_dict())
In order to answer the question of "which stations are important", we need to specify things a bit more. Perhaps a measure such as betweenness centrality or degree centrality may be appropriate here.
The naive way would be to iterate over all the rows. Go ahead and try it at your own risk - it may take a long time :-). Alternatively, I would suggest doing a pandas groupby.
In [ ]:
# # Run the following code at your own risk :)
# for r, d in trips.iterrows():
# start = d['from_station_id']
# end = d['to_station_id']
# if (start, end) not in G.edges():
# G.add_edge(start, end, count=1)
# else:
# G.edge[start][end]['count'] += 1
In [ ]:
for (start, stop), d in trips.groupby(['from_station_id', 'to_station_id']):
G.add_edge(start, stop, count=len(d))
First off, let's figure out how dense the graph is. The graph's density is the number of edges divided by the total number of nodes.
NetworkX provides an implementation of graph density, but it assumes self-loops are not allowed. (Self-loops are edges from one node to itself.) Let's see what the graph density is
In [ ]:
G.edges(data=True)
Applying what we learned earlier on, let's use the betweenness centrality metric.
In [ ]:
centralities = nx.betweenness_centrality(G, weight='count')
In [ ]:
sorted(centralities.items(), key=lambda x:x[1], reverse=True)
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.bar(centralities.keys(), centralities.values())
Applying what we learned earlier, let's use the "degree centrality" metric as well.
In [ ]:
decentrality = nx.degree_centrality(G)
plt.bar(decentrality.keys(), decentrality.values())
The code above should have demonstrated to you the basic logic behind storing graph data in a human-readable format. For the richest data format, you can store a node list with attributes, and an edge list (a.k.a. adjacency list) with attributes.
NetworkX's API offers many formats for storing graphs to disk. If you intend to work exclusively with NetworkX, then pickling the file to disk is probably the easiest way.
To write to disk:
nx.write_gpickle(G, handle)
To load from disk:
G = nx.read_gpickle(handle)
Let's write the graph to disk so that we can analyze it further in other notebooks.
In [ ]:
nx.write_gpickle(G, 'datasets/divvy_2013/divvy_graph.pkl')
In [ ]: